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METHOD AND APPARATUS FOR 
DETECTING VOICE ACTIVITY IN A SPEECH SIGNAL 

5 BACKGROUND OF THE INVENTION 

1 . Field of the Inveniion 

The present inveniion relates generally to the field of speech coding in 
communication systems, and more particularly to detecting voice activity in a 
communications system. 

JO 2. Description of Related An 

Modem communication systems rely heavily on digital speech 
processing in general, and digital speech compression in particular, in order to provide 
efficient systems. Examples of such communication systems are digital telephony 
trunks, voice mail, voice annotation, answering machines, digital voice over data 
15 links, etc. 

A speech communication system is typically comprised of an encoder, 
a communication channel and a decoder. At one end of a communications link, the 
speech encoder converts a speech signal which has been digitized into a bit-stream. 
The bit-stream is transmitted over the communication channel (which can be a storage 
20 medium), and is converted again into a digitized speech signal by the decoder at the 
other end of the communications link. 

The ratio between the number of bits needed for the representation of 
the digitized speech signal and the number of bits in the bit-stream is the compression 
ratio- A compression ratio of 12 to 1 6 is presently achievable, while still maintaining 
25 a high quality reconstructed speech signal. 

A significant portion of normal speech is comprised of silence, up to an 
average of 60% during a two-way conversation. During silence, the speech input 
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device, such as a microphone, picks up the environmenl or background noise. The 
noise level and characteristics can vary considerably, fron. a quiet room to a noisy 
street or a fast moving car. However, most of the noise sources carry less information 
than the speech signal and hence a higher compression ratio ,s achievable during the 
5 silence periods. In the following description, speech will be denoted as "active-voice^ 
and silence or background noise will be denoted as "non-active-voice-. 

The above discussion leads to the concept of dual-mode speech coding 
schemes, which are usually also variable-rate coding schemes. The active-voice and 
the non-active voice signals are coded differently in order to improve the system 
10 efficiency, thus providing two differem modes of speech coding. The different modes 
of the input signal (active-voice or non-active-voice) are determined by a signal 
classifier which can operate external to. or within, the speech encoder. The coding 
scheme employed for the non-active-voice signal uses less bits and results in an 
overall higher average compression ratio than the coding scheme employed for the 
15 active-voice signal. The classifier output is binary, and is commonly called a "voicing 
decision.'- The classifier is also commonly referred to as a Voice Activity Detector 

("VAD")- 

A schematic representation of a speech communication system which 
employs a VAD for a higher compression rate is depicted in Figure 1 . The input to the 
20 speech encoder 1 10 is the digitized incoming speech signal 105. For each firame of a 
digitized incoming speech signal the VAD 125 provides the voicing decision 140, which 
is used as a switch 145 between the active-voice encoder 120 and the non-active-voice 
encoder 115. Either the active-voice bit-suream 135 or the non-active-voice bit-stream 
130, together with the voicing decision 140 are transmitted through the communication 
25 channel 150. At the speech decoder 155 the voicing decision is used in the switch 160 
to select the noh-active-voice decoder 165 or the active-voice decoder 170. For each 
frame, the output of either decoders is used as the reconstructed speech 1 75. 

An example of a method and apparatus which employs such a dual-mode 
system is disclosed in U.S. Patent No. 5J74.849. commonly assigned to the present 
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assignee and herein incorporated by reference. According to U.S. Patent No. 5.774.849, 
four parameters are disclosed which may be used to make the voicing decision. 
Specifically, the full band energy, the frame low-band energy, a set of parameters called 
Line Spectral Frequencies ("LSF") and the frame zero crossing rate are compared to a 
5 long-tenn average of the noise signal. WTiile this algorithm provides satisfactory^ results 
for many applications, the present inventors have determined that a modified decision 
algorithm can provide improved performance over the prior an voicing decision 
algorithms. 

SUMMARY OF THF INVENTION 
^ method and apparatus for generating frame voicing decisions for an 
mcoming speech signal having periods of active voice and non-active voice for a 
speech encoder in a speech communications system. A predetermined set of 
parameters is extracted from the incoming speech signal, including a pitch gain and a 
pitch lag. A frame voicing decision is made for each frame of the incoming speech 
15 signal according to values calculated from the extracted parameters. The 

predetermined set of parameters funher includes a frame ftill band energy, and a set 
of spectral parameters called Line Spectral Frequencies (LSF). 



20 



BRIEF DESCRIPTION OF THE DRAWINGS 
The exact nature of this invention, as well as its objects and 
advantages, will become readily apparent from consideration of the following 
specification as illustrated in the accompanying drawings, in which like reference 
numerals designate like parts throughout the figures thereof, and wherein: 

Figure 1 is a block diagram representation of a speech communication 
25 system using a VAD; 

Figures 2(A) and 2(B) are process nowcharts illustrating the 
operation of the VAD in accordance with the present invention; and 

Figure 3 is a block diagram illustrating one embodiment of a VAD 
according to the present invention. 
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DFT AILED PF!.CR1PT10N 
OF THE PREFFT^RFD EMB OHIMENTS 
The following description is provided to enable any person skilled in 
the an to make and use the invention and sets forth the best modes contemplated by 
the mvemor for carrying out the invention. Various modifications, however, will 
remain readily apparent to those skilled in the art, since the basic principles of the 
present invention have been defined herein specifically to provide a voice activity 
detection method and apparatus. 

In the following description, the present invention is described in terms 
of functional block diagrams and process flow charts, which are the ordinary means 
for those skilled in the art of speech coding for describing the operation of a VAD. 
The presem invention is not limited to any specific programming languages, or any 
specific hardware or software implementation, since those skilled in the art can readily 
determine the most suitable way of implementing the teachings of the present 
15 invention. 

In the preferred embodiment, a Voice Activity Detection (VAD) 
module is used to generate a voicing decision which switches between an active-voice 
encoder/decoder and a non-active-voice encoder/decoder. The binary voicing 
decision is either 1 (TRUE) for the active-voice or 0 (FALSE) for the non-active- 
20 voice. 

The VAD process flowchart is illustrated in Figures 2(A) and 2(B). 
The VAD operates on frames of digitized speech. The fmmes are processed in time 
order and are consecutively numbered from the beginning of each 
conversation/recording. The illustrated process is performed once per frame. 

At the first block 200, four parametric features are extracted from the 
input signal. Extraction of the parameters can be shared with the active-voice encoder 
module 1 20 and the non-active-voice encoder module 1 1 5 for computational 
efficiency. The parameters are the frame full band energy, a set of spectral parameters 
called Line Spectral Frequencies ("LSF")- the pitch gain and the pitch lag. A set of 



25 



wo 00/17856 



PCT/US99/I9806 



10 



20 



linear prediciion coefficients is derived from the auto correlation and a set of 

{lsf.Y 

IS derived from the set of linear prediciion coefficients, as described in 
ITU-T. Study Group 15 Contribution - Q. 12/15. Draft Recommendation G.729. June 
8. 1995; Version 5.0. or DIGITAL SPEECH - Coding for Low Bit Rate 
Communication Systems by A.M. Kondoz. John Wiley & Son. 1994. England. The 
full band energy E is the logarithm of the normalized first auio correlation coefficient 
^(0): 



£ =10-log 



— R(0) 



where A'' is a predetermined normalization factor. 

The pitch gain is a measure of the periodicity of the input signal. The higher the pitch 
gain, the more periodic the signal, and therefore the greater the likelihood that the 
signal is a speech signal. The pitch lag is the fundamental frequency of the speech 
(active-voice) signal. 

After the parameters are extracted, the standard deviation a of the pitch 
15 lags of the last four previous frames are computed at block 205. The long-term mean 
of the pitch gain is updated with the average of the pitch gain from the last four 
frames at block 210. In the preferred embodiment, the long-term mean of the pitch 
gain is calculated according to the following formula: 



Pgain = 0.8* Pgain + 0.2 * (average of last four frames] 



The short-term average of energy, Es . is updated at block 21 5 by 
averaging the last three frames with the current frame energy. Similarly, the short- 



term average of LSF vectors. LSFs , is updated at block 220 by averaging the last 
three LSF frame vectors with the current LSF frame vector extracted by the parameter 
25 extractor at block 200. If the standard deviation a is less than T, or the long-term 
mean of the pitch gain is greater than T„ then a flag Pn,^ is set to one. otherwise P,,,.. 
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If O ^ 1 1 "-"^ » pair 
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a < T, OR P,.„ > T:. then Pn,, =1^ else - 0. 



,n the prg W e.bodi.ent, T. = 1 .2 and T. = 0.7. At block 230. a .in^u. energy 
buffer is updated with the .initnun. energy value over the last 128 fran.es. In other 
words, if the present ener.y level . less than the nt.imun. energy level deter.tned 
over the last 1 28 frantes. then the value of the buffer ,s updated, otherwise the buffer 
value is unchanged. 

If ,he frame coum (i e. current frame number) is less rhan a 
prede,en.ined frame coun, Nr a, block 235, where N, is 32 in ,he preferred 
Lbodimen., an inirializaUon rourinc is perfcmred by blocKs 240 A. blck 240 

,He average energy E . and ,be long-renn average noise specrum are caleulared 
over ,he las, N, fnunes. Tbe average energy E is .he average of .he energy of .he las, 
Nr ftames. The ini.ial value for £, ealcula.ed a. block 240, is: 



15 



N r.. 



20 



25 



The long-.enn average noise spectrum ISF, is ,he average of Ae LSF 
vecors of ,he las. N, frames. A. Hock 245. if.be ins.an.aneous energy E ex«ac,ed a. 
block 200 is less .han 15 dB, .hen .he voicing decision is se, .o zero (block .55), 
orherwise ,he vo.eing decision is se, one (block 250). The processing for .he frame ,s 
,he„ eomple,ed and .he nex. frame is processed, beginning wi,h block 200. 

The ini,ializa,ion processing of blocks 240-255 ini.ializes , he 
processing over .he las. few frames. 1. is no. cri.ical .o .he operarion of .he presen, 
mvemion and may be skipped. T)re ealcala.ions of block 240 a„ required, however. 
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for the proper operation of the invention and should be performed, even if the voicinu 
decisions of blocks 245-255 are skipped. Also, during initialization, the voicing 
decision could always be set to "1" without significantly impacting the performance of 
the present invention. 

If the frame count is not less than Ni at block 235, then the first time 
through block 260 (Frame_Count = Ni ), the long-term average noise energy En is 
initiahzed by subtracting 12 dB from the average energy E : 



EN = E-12dB 



10 



Next, at block 265, a spectral difference value SDj is calculated using 
the normalized Itakura-Saito measure. The value SD, is a measure of the difference 
between two spectra (the current frame spectra represented by R and E^ , and the 

background noise spectrum represented by a . The Itakurass-Saito measure is a well- 
15 known algorithm in the speech processing an and is described in detail, for example, 
in Discrete-Time Processing of Speech Signals. Deller, John R., Proakis, John G. and 
Hansen. John H.L., 1987, pages 327-329. herein incorporated by reference. 
Specifically, SD, is defined by the following equation: 



SD 



a R a 



1 = 



20 

where E^ is the prediction error from linear prediction (LP) analysis of 
the current frame; 

R is the auto-correlation matrix from the LP analysis of the current 

frame; and 

25 a is a linear prediction filter describing the background noise 
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Al block 270 the spectral differences SD, and SD, are calculated using 
a mean square error method according to the foUow.ng equat.ons: 



10 



SDi = ^ [LSFs (i) - LSFs (..] 
/=1 



p 2 

5Z)3= I [LSFs{i)-LSF{i)-] 



Where LSFS is the short-term average of LSF: 
LSF^ is the long-term average noise spectrum: and 
LSF is the current LSF extracted by the parameter extraction. 



The long-term tnean of SD3 (sm_SD,) in the preferred embodiment is 
15 updated at block 275 according to the following equation: 



sm 



SD2 = 0.4* SD2 + 0.6*sm_SD2 



Thus, .he long .enn mean of SD, is a linear combinalion of .he pas. long-.em. ,.ean 
20 and the current SD. value. 

The initial voicing decision, obtained in block 280. is denoted by 
The value of I., is determined according to the following decision statements: 
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If Es > En + XldB 
OR 

E > En + X2dB 
then IVD = 1; 

If Es - Ek < X3 dB 
AND 

sm_SD2 < T3 

AND 

Frame_Coum > 128 
then IVD = 0 ; else Ivd = 1; 

-I -2 

If E )l/2 (E +E ) + X4dB 

OR 

SDl ) 1.5 
then Ivd = 1 . 



In the preferred embodiment, X, = 1, X. = 3. X, = 2, X, = 7. and T, = 0.00012. 

5 The initial voicing decision is smoothed at block 285 to reflect the long 

term stationary nature of the speech signal. The smoothed voicing decision of the 
frame, the previous frame and the frame before the previous frame are denoted by 
5,*^, and 5-3, respectively. Both S;,\ and S'- are initialized to 1 and S°.^ = 
A Boolean parameter f;"' is initialized to 1 and a counter denoted by C is initialized 

10 to 0. The energy of the previous frame is denoted by £ , . Thus, the smoothing stage 
is defined by: 
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= 1 andJ^,^^ = 0 andS-;^ - 1 and S'J^^ = 
C =C +1 
=1 

} 

else { 
F'' =0 
C = 0 

} 

} 

F- =1 

Ce is reset too if S;^ = 1 and S'^^=) and 1vd = 1. 
IfPn, = LthenS%D=l 



IfE< 15 dB.then S°vd = 0 



m the preferred embodiment, T. = 14. The final value of S^o represents the final 
voicing decision, with a value of "l'' representing an active voice speech signal, and a 
10 value of "0" representing a non-active voice speech signal. 

Fso is a nag which indicates whether consecutive frames exhibit 
spectral stationanty (i.e., spectrum does not change dramatically from frame to 
frame). Fs, is set at block 290 according to the followmg where is a counter 
initialized to 0. 
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If Frame_ Count > 128 AND SD3 < Ts 
then 

Cs = Cs + ] 

else 

C5 = 0: 
If Cs > N 

FsD = I 
else 

FsD = 0. 



In the preferred embodimem, T5 = 0.0005 and N = 20. 

The running averages of the background noise characteristics are 
5 updated at the last stage of the VAD algorithm. At block 295 and 300. the following 
conditions are tested and the updating takes place only if these conditions are met: 



If Es < En + 3 AND Pflag = 0 

then En = ySfeN * En + (1 - /3E>i) * [max of E AND B] 

AND 

LSFn (i) = /5LSF * LSFn (i) + (I - /JLSF) * LSF (i) / = 1, ...p 

If Frame_ Count > 128 AND 

En < Min AND Fsd = 1 AND Pnag = 0 

then _ 

En = Min 

else If Frame _ Count > 1 28 AND En > Min + 1 0 

then 

En = Min. 



Figure 3 illustrates a block diagram of one possible implementation of 
a VAD 400 according to the present invention. An extractor 402 extracts the required 
predetermined parameters, including a pitch lag and a pitch gain, from the incoming 
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speech signal 105. A calculator untt 404 performs the necessary calculations on the 
extracted parameters, as illustrated by the flowcharts m F.gs. 2(A) and 2(B). A 
decision unit 406 then determines whether a current speech frame is an active voice or 
a non-active voice signal and outputs a voicing decision 140 (as shown in Fig. 1). 

Those skilled in the art will appreciate that various adaptations and 
modifications of the just-descnbed preferred embodiments can be configured without 
departing from the scope and spirit of the invention. Therefore, it is to be understood 
that within the scope of the appended claims, the invention may be practiced other 
than as specifically described herein. 
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CLAIMS 

Is Claimed Is : 

1 L In a speech communication system, a method for generating a frame 

2 voicing decision comprising the steps of: 

3 (a) extracting a predetermined set of parameters, including a pitch eai: 

4 and a pitch lag, from the incoming speech signal for each frame: 

5 and 

6 (b) making a frame voicing decision according to the extracted 

7 predetermined set of parameters. 

1 2. The method according to claim 1 , wherein the predetermined set of 

2 parameters further comprises a full band energy and line spectral frequencies (LSF). 

1 3. A method according to claim 2, wherein the step of making a frame 

2 voicing decision further comprises the steps of: 
3 

4 i. calculating a standard deviation a of the pitch lag; 

5 ii. calculating a long-term mean of pitch gain; 

6 iii. calculating a short-term average of energy E, Es ; 
1 iv. calculating a shon-term average of LSFs ; 

8 V. calculating an average energy E ; and 

9 vi. calculating an average LSF value, LSFn . 

1 4. A method according to claim 3, wherein the step of making a frame 

2 voicing decision further comprises the steps of: 

3 i) calculating a spectral difference SD, using a normalized 

4 Itakura-Saito measure; 

5 ii) calculating a spectral difference SD^ using a mean 

6 square error method; 

7 iii) calculating a spectral difference SD, using a mean 

8 square error method; and 

9 iv) calculating a long-term mean of SD.. 
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5. A method according to cla.m 4. wherein an initial frame voicing 
decision is made according to the calculated values. 



1 6. A method according lo claim 5. wherein the initial frame voicing 

1 decision is smoothed. 

, 7 A method accordmg lo claim 6. wherein an initialization routine is 

2 performed for a predetermined number of inittal frames, such that the votcing dectsion 

3 is set to active voice. 
8. A voice activity detector (VAD) for making a voicing decision on an 

incoming speech signal frame, the VAD comprising: 

an extractor for extracting a predetermined set of parameters, 
including a pitch gam and a pitch lag, from the incoming speech signal 

5 for each frame; 

g a calculator unit for calculating a set of predetermined values 

based on the extracted predetermined set of parameters: and 

a decision umt for making a frame voicing decision according 
to the predetermined set of values. - ' 

1 9 The VAD according to claim 8. wherein the predetermined set of 

2 parameters further comprises a full band energy and line spectral frequencies (LSF). 
1 1 0. The VAD according to claim 9, wherein the calculator unit calculates: 

a standard deviation a of the pitch lag; 
a long-term mean of pitch gain; 
a short-term average of energy E, Es ; 
a short-term average of LSF, LSFs ; 
an average energy E : and 
an average LSF value. LSFn . 

1 1 1 . The VAD according to claim 1 0, wherein the calculator unit further 

2 calculates: 

3 a spectral difference SD, using a normalized Itakura-Saito 



7 
8 
9 



2 
3 
4 

5 

6 

7 
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4 
5 
6 

7 and 



measure; 

a spectral difference SD. using a mean square error method; 
a spectral difference SD, using a mean square error method: 



8 a long-term mean of SD,. 

1 1 2. The V AD according to claim 1 1 . wherein the decision unit makes an 

2 initial frame voicing decision according to the values calculated by the calculation 

3 means 

1 1 3. The VAD according to claim 1 2, wherein the initial frame voicing 

2 decision is smoothed. 

1 1 4. A voice activity detection method for detecting voice activity in an 

2 incoming speech signal frame, the improvement comprising making a voicing 

3 decision based on a pitch lag and a pitch gain of the speech signal frame. 

1 1 5. The voice activity detection method of claim 14, further comprising 

2 making the voicing decision based on a frame full band energy and a set of spectral 

3 parameters called Line Spectral Frequencies (LSF). 
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METHOD AND APPARATUS FOR 
DETECTING VOICE ACTIVITY IN A SPEECH SIGNAL 

5 BACKGROUND OF THE INVENTION 

1 . Field of the Invention 

The present invention relates generally to the field of speech codine in 
communication systems, and more particularly to detecting voice activity in a 
communications system. 

10 2. Description of Related Art 

Modem communication systems rely heavily on digital speech 
processing in general, and digital speech compression in particular, in order to provide 
efficient systems. Examples of such communication systems are digital telephony 
trunks, voice mail, voice annotation, answering machines, digital voice over data 
15 links, etc. 

A speech communication system is typically comprised of an encoder, 
a communication channel and a decoder. At one end of a communications link, the 
speech encoder converts a speech signal which has been digitized into a bit-stream. 
The bit-stream is transmitted over the communication channel (which can be a storage 
20 medium), and is converted again into a digitized speech signal by the decoder at the 
other end of the communications link. 

The ratio between the number of bits needed for the representation of 
the digitized speech signal and the number of bits in the bit-stream is the compression 
ratio. A compression ratio of 12 to 16 is presently achievable, while still maintaining 
25 a high quality reconstructed speech signal. 

A significant portion of normal speech is comprised of silence, up to an 
average of 60% during a two-way conversation. During silence, the speech input 
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l,se .eve, and cha.c,=ns„cs can vary cons.de.abW. f,o. a ,n.e, ,00. .0 a no,s> 
I ! a fa. .ov,n, ca. However. .OS, of .e no.se sources ca.. .ess .nfo^ra on 
1 peech s.,na, and hence a h.her co.press.on rario ,s aCeva.e dnr.ne ,he 
r„:p:::Ln.ero,,o„.ns — speech.,, .eaenoredas.^^^^^^^^^^^^ 

and silence or bacRgronnd noise wi„ be denoted as -non-acve-vo.ce . 

The above discussron leads >o ,he concep, of dua,.mode speech cod.ng 
3.Hen,es. which are usuall. also variable-rare coding schemes. The acrve-voice and 

„on.ac„ve vorce signals are coded dilferenrl, in order .0 improve ,he sysren, 
:;:enc..h„sprov,ai„grwodiffere„,modesofspeechcod,ng. Thedr^^^^^^^^ 

of *e inpu, signal ,ac.ive-voice or non-ac.ive-voice, are derermrne a . 

Le enrploved for ,he non-acive-voice signal uses less bns and resuhs ,n an 
r:ihigheraveragecompressio„ra,,o,ha„,hecod,ngsche.eemp.^for^^^^^ 
- • , The Classifier output is binary, and is commonly called a voicmg 



15 

("VAD"). 



. t;on nf a sneech communication system which 
A schematic representation ot a speecn cou 

.„,o.aV^OforaHghercompressionra.eisde.c.edi„Pi^.^^^^^^^^ 

^ .H.r 1 10 is the digitized incoming speech signal 105. For each trame 
20 speech encoder uu is uic ui^ . . , • • „ i/in which 

dUai— sp=echsigna,0,eVAD125p,ovides*=vo,c,„gdec,s.o„mw^^^ 

.led as a swircH ,45 berween .he ac,ive-vo,ce e,.oder ,20 and .he non . e o^ 
.er 1 ,5 Eiter *e acive-voice bi.-s.ream 135 or Ure „on-ac„ve-vo,ce b,.-s.rean, 
rerr.hevoicingdeci.onHOare— ed.hrongh^ 
,3 h!: ,50. M*=speechdecoder,55.hevo,ci„gdecision.s„sedin.es„..M. 
,0 selec. .he non-acive-voice decoder 165 or .he acve-voice decoder 170. 
f^e. ,he OU.PU, of eiiher decoders is used as .he recons^ced speech .75^ 

An example of a me.hod and appararus which empioys such a dual mod 

, J ■ 1 1 <; Patent No 5 774.849. commonly assighed to the present 
system is disclosed m U.S. Patem MO. s,' 
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assignee and herein incorporated by reference. According to U.S. Patent No. 5,774.849 
four parameters are disclosed which may be used to make the voicing decision. 
Specifically, the full band energy, the frame low-band energy, a set of parameters called 
Line Spectral Frequencies f'TSF") and the frame zero crossing rate are compared to a 
5 long-term average of the noise signal. While this algorithm provides satisfactory results 
for many applications, the present inventors have determined that a modified decision 
algorithm can provide improved performance over the prior an voicing decision 
algorithms. 

SUMMARY OF THF INVENTION 
10 A method and apparatus for generating frame voicing decisions for an 

incoming speech signal having periods of active voice and non-active voice for a 
speech encoder in a speech communications system. A predetermined set of 
parameters is extracted from the incoming speech signal, including a pitch gain and a 
pitch lag. A frame voicing decision is made for each frame of the incoming speech 
15 signal according to values calculated from the extracted parameters. The 

predetermined set of parameters funher includes a frame full band energy, and a set 
of spectral parameters called Line Spectral Frequencies (LSF). 

BRIEF DESCRIPTION OF THE DRAWINGS 
20 The exact nature of this invention, as well as its objects and 

advantages, will become readily apparent from consideration of the following 
specification as illustrated in the accompanying drawings, in which like reference 
numerals designate like pans throughout the figures thereof, and wherein: 

Figure 1 is a block diagram representation of a speech communication 
25 system using a VAD; 

Figures 2(A) and 2(B) are process flowcharts illustrating the 
operation of the VAD in accordance with the present invention; and 

Figure 3 is a block diagram illustrating one embodiment of a VAD 
according to the present invention. 



30 
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r>PTAn FD DFSCRIPTION 
OF THF PREFE^RFn FMBO OIMENTS 
The following description is provided to enable any person skilled in 
the an to make and use the invention and sets forth the best modes contemplated by 
the inventor for carrying out the invention. Various modifications, however, will 
remain readily apparent to those skilled m the art, since the basic principles of the 
present invention have been defined herein specifically to provide a voice activity 
detection method and apparams. 

In the following description, the present invention is described in temis 
of functional block diagrams and process fiow charts, which are the ordinary means 
for those skilled in the art of speech coding for describing the operation of a VAD. 
The presem invemion is not limited to any specific programming languages, or any 
specific hardware or software implememation. since those skilled in the art can readily 
detemiine the most suitable way of implementing the teachings of the presem 
15 invention. 

In the preferred embodiment, a Voice Activity Detection (VAD) 
n^odule is used to generate a voicing decision which switches between an active-voice 
encoder/decoder and a non-active-voice encoder/decoder. The binary voicing 
decision is either 1 (TRUE) for the active-voice or 0 (FALSE) for the non-active- 
20 voice. 

The VAD process flowchart is illustrated in Figures 2(A) and 2(B). 
The VAD operates on frames of digitized speech. The frames are processed in time 
order and are consecutively numbered from the beginning of each 
conversation/recording. The illustrated process is performed once per frame. 

At the first block 200. four parametric features are extracted from the 
input signal. Extraction of the parameters can be shared with the active-voice encoder 
module 120 and the non-active-voice encoder module 1 15 for computational 
efficiency. The parameters are the frame fuW band energy, a set of spectral parameters 
called Line Spectral Frequencies ("LSF"), the pitch gain and the pitch lag. A set of 



25 
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20 



linear prediction coefficienis is derived from the auio correlation and a set of 
fef", }'' 

'=1 is derived from the set of linear prediction coefficients, as described in 
ITU-T, Study Group 15 Coniribmion - Q. 12/15. Draft Recommendation G.729. June 
8. 1995, Version 5.0, or DIGITAL SPEECH - Coding for Low Bit Rate 
Communication Systems by A.M. Kondoz. John Wiley & Son, 1994. England. The 
full band energy E is the logarithm of the normalized first auto correlation coefficient 
RiO): 



E =10-log 



— R(0) 
.N . 



where is a predetermined normalization factor. 

10 The pitch gain is a measure of the periodicity of the input signal. The higher the pitch 
gain, the more periodic the signaL and therefore the greater the likelihood that the 
signal is a speech signal. The pitch lag is the fundamental frequency of the speech 
(active-voice) signal. 

After the parameters are extracted, the standard deviation a of the pitch 
15 lags of the last four previous frames are computed at block 205. The long-ierm mean 
of the pitch gain is updated with the average of the pilch gain from the last four 
frames at block 210. In the preferred embodiment, the long-term mean of the pitch 
gain is calculated according to the following formula: 



Pgain= 0.8*Pgain + 0.2 * [average of last four frames] 



The short-term average of energy, Es , is updated at block 215 by 
averaging the last three frames with the current frame energy. Similarly, the shon- 



term average of LSF vectors. LSFs . is updated at block 220 by averaging the last 
three LSF frame vectors with the current LSF frame vector extracted by the parameter 
25 extractor at block 200. If the standard deviation a is less than T, or the long-term 
mean of the pitch gain is greater than T,, then a flag Pn.^ is set to one. otherwise Pn,^. 
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equals zero at block 225. 

If a < T, OR P,.„ > T:. then = 1 , else = 0. 

^- .T-i7andT=0 7 At block 230. a minimum energy 
In the preferred embodiment. T, - i -2 and 1 , u. / . 

buffer is updated wuh the minimum energy value over the last 128 frames. In other 
words if the present energy level .s less than the minimum energy level determined 
over the last 128 frames, then the value of the buffer ,s updated, otherwise the buffer 
value is unchanged. 

If the frame count (i.e. current frame number) is less than a 
predetermined frame count Nv at block 235, where Ni is 32 in the preferred 
embodiment, an initialization routine is performed by blocks 240 -_25^ At block 240 
the average energy E . and the long-term average noise spectrum LSF. are calculated 
over the last frames. The average energy E is the average of the energy of the last 
frames. The initial value for E.. calculated at block 240, is: 



15 



_ 1 



-4-1: E 



20 



25 



The long-term average noise spectrum LSF. is the average of the LSF 
vectors of the last frames. At block 245, if the instantaneous energy E extracted at 
block 200 is less than 15 dB, then the voicing decision is set to zero (block 255), 
otherwise the voicing decision is set one (block 250). The processing for the frame is 
then completed and the next frame is processed, beginning with block 200. 

The initialization processing of blocks 240-255 initializes the 
processing over the last few frames. It is not critical to the operation of the present 
invention and mav be skipped. The calculations of block 240 are required, however. 
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7 

for the proper operation of the inveniion and should be performed, even if ihe voicinu 
decisions of blocks 245-255 are skipped. Also, during initialization, the voicing 
decision could always be set to "1 " without significantly impacting the performance of 
the present invention. 

If the frame count is not less than Ni at block 235, then the first time 
through block 260 (Frame_Count = Ni ), the long-term average noise energy En is 
initialized by subtracting 12 dB from the average energy E : 



EN = E-12dB 

10 

Next, at block 265, a spectral difference value SD, is calculated using 
the nomialized Itakura-Saito measure. The value SD, is a measure of the difference 
between two spectra (the current frame spectra represented by R and E^ , and the 

— ^ 

background noise spectrum represented by a . The Itakurass-Saito measure is a well- 
15 known algorithm in the speech processing art and is described in detail, for example, 
in Discrete-Time Processing of Speech Signals, Deller, John R., Proakis. John G. and 
Hansen, John H.L., 1987, pages 327-329, herein incorporated by reference. 
Specifically, SD, is defined by the following equation: 



20 



SDi 



a R a 
En- 



where E„ is the prediction error from linear prediction (LP) analysis of 
the current frame; 

R is the auto-correlation matrix from the LP analysis of the current 

frame; and 

25 a is a linear prediction fiher describing the background noise 
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obtained from I SFn • 

A, block 270 ,h. specral differences SD, and SD, are calculated u.ne 
a „ean square error me.hcd accord.ng ro ,he following equar.ons: 



5Z)2= S [LSFs u) - LSFn rn] 



2 



10 



Where LSB is the short-term average of LSF; 

TSF-S- is the long-term average noise spectrum; and 

LSF is the current LSF extracted by the parameter extraction. 



The long-term mean of SD. (sm.SD.) in the preferred embodiment ts 
15 updated at block 275 according to the following equation: 



sm SD2 = 0.4*SD2 + 0.6*sm_SD2 



is a linear combination of the past long-term mean 
Thus, the long term mean of SD, is a linear co 

20 and the current SD, value. 

The inilial voicing decision, obtained in block 280. is denoted by /,.„. 
The value of is de.ermined according to ,he following decision staremems: 
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If Es > En + XI dB 
OR 

E > En + X2dB 
then IVD = 1; 

If is -En < X3dB 
AND 

sm_SD2 < T3 

AND 

Frame_ Count ) 128 
then IVD = 0 ; else Ivd = 1; 

-) -2 

If E )l/2 (E +E ) + X4dB 

OR 

SDl > 1.5 
then Ivd = 1 . 



In the preferred embodiment, X, = 1 , X, = 3. X, = 2, X^ = 7, and = 0.00012. 

5 The initial voicing decision is smoothed at block 285 to reflect the long 

term stationary nature of the speech signal. The smoothed voicing decision of the 
frame, the previous frame and the frame before the previous frame are denoted by 

and 5,7/3, respectively. Both and are initialized to 1 and 5^,, = . 

A Boolean parameter is initialized to 1 and a counter denoted by C is initialized 

10 to 0. The energy of the previous frame is denoted by . Thus, the smoothing stage 
is defined by: 
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ifF-; = I and 7,.^, = 0 and S;^ = 1 and 



IV 

c =C +1 
ifC.^T { 

} 

else { 
F'' =0 

CD 

} 

} 

F- =1 



Ce is reset too if S;^ = 1 and Sv|> = 1 and 1vd = 1. 



IfPn,,= Uthen S°vd= 1 



IfE< 15 dB.then S°vd = 0 



In the preferred embodiment, T. = 14. The final value of S%o represents the final 
voicmg decision, with a value of "l'' represeming an active voice speech signal, and a 
10 value of "0" representing a non-active voice speech signal. 

is a flag which indicates whether consecutive frames exhibit 
spectral stationarity (i.e., spectrum does not change dramatically from frame to 
frame). F,, is set at block 290 according to the following where Cs is a counter 
initialized to 0. 
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If Fraine_ Count > 128 AND SD3 < Ts 
then 

Cs = Cs + 1 

else 

Cs = 0; 
If Cs > N 

FSD = 1 

else 

FsD = 0. 



In the preferred embodiment, T5 = 0.0005 and N = 20. 

The running averages of the background noise characteristics are 
5 updated at the last stage of the VAD algorithm. At block 295 and 300, the following 
conditions are tested and the updating takes place only if these conditions are met: 



If Es < En + 3 AND Pflag = 0 

then En = /?feN * En + (1 - /5feN) * [max of E AND Es] 

AND 

LSFn (i) = J3LSF * LSFn (i) + (1 - /JLsf) * LSF (i) / = I, ...p 

If Frame_ Count > 128 AND 

En < Min AND Fsd = 1 AND Pfiag = 0 

then _ 

En - Min 

else If Frame _ Count > 1 28 AND En > Min + 1 0 

then 

En = Min. 

Figure 3 illustrates a block diagram of one possible implementation of 
a VAD 400 according to the present invention. An extractor 402 extracts the required 
predetermined parameters, including a pitch lag and a pitch gain, from the incominc 
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speech s,sna. >05. A ca,cu.a,o, u„„ 404 perfo™. ,he „ecessa. ca cu,a,„ on ,he 
elaced para.e,ers, as iUus,ra,ea by .h. flo«chan. in Figs. 2(A, and 2(B). 
aeasion uni, 406 ,he„ de.e™,nes wh.her a cu.e„. speech frame is an ac.ve vo.ce o, 
3 „on.ac„ve voice s.gna, and o„,pu,s a vo.cing decision ,40 ,as sho>™ .n F,g. ». 

Those skilled m ,he an will apprecia.e that various adaptattons and 
.odir,ca.,o„s of the psfdescrihed preferred ennhodiments can be configured without 
departin. front the scope and spirt, of the invention. Therefore, it is to be understood 
,ha, wtthin the scope of the appended claitns, the inventton ntay be practtced other 
than as specifically described herein. 
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CLAIMS 

What Is Claimed Is : 

1 1 • In a speech communication system, a method for generating a frame 

2 voicing decision comprising the steps of: 
(a) extracting a predetermined set of parameters, including a pitch gain 

4 and a pitch lag, from the incoming speech signal for each frame; 

5 and 

6 (b) making a frame voicing decision according to the extracted 

7 predetermined set of parameters. 

1 2. The method according to claim 1 , wherein the predetermined set of 

2 parameters further comprises a Ml band energy and line spectral frequencies (LSF). 

1 3. A method according to claim 2, wherein the step of making a frame 

2 voicing decision further comprises the steps of: 
3 

4 i, calculating a standard deviation a of the pitch lag; 

5 ii. calculating a long-term mean of pitch gain; 

6 iii. calculating a short-term average of energy E, Es ; 

iv. calculating a shon-term average of LSFs ; 

8 v. calculating an average energy E : and 

9 vi. calculating an average LSF value, LSFn . 

1 4. A method according to claim 3, wherein the step of making a frame 

2 voicing decision further comprises the steps of; 

3 i) calculating a spectral difference SD, using a normalized 

4 Itakura-Saito measure; 

5 ii) calculating a spectral difference SD, using a mean 

6 square error method; 

7 iii) calculating a spectral difference SD3 using a mean 

8 square error method; and 

9 iv) calculating a long-term mean of SD.. - - 



7 
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^ 5. A method according to claim 4, wherein an initial frame vo.cmg 

2 decision is made according to the calculated values. 

, 6. A method according to claim 5. wherein the mitial frame vccmg 

7 decision is smoothed. 

7. A method according .o claim 6, wherein an ini.ializa,,onro«,„e,s 



, performed for a prederen^ined number of inina, frames, such .ha. ,he voicing decision 

3 is set to active voice. 

8. A voice activity detector (VAD) for making a voicing decision on an 



1 

3 
4 
5 
6 
7 
8 
9 



incoming speech signal frame, the VAD comprising: 

an extracior for extracting a predetermined set of parameters, 
including a pitch gain and a pitch lag, from the incoming speech signal 

for each frame; 

a calculator unit for calculating a set of predetermined values 
based on the extracted predetermined set of parameters: and 

a decision unit for making a frame voicing decision accordmg 
to the predetermined set of values. 
1 9 The VAD according to claim 8, wherein the predetermined set of 

. parameters farther comprises a full band energy and line spectral frequencies (LSF). 

^- »^ ^i.im Q wherein the calculator unit calculates: 
1 1 0. The VAD accordmg to claim y, wnerem mc 

a standard deviation a of the pitch lag; 
a long-term mean of pitch gain; 
a short-term average of energy E, Es ; 
a short-term average of LSF, LSFs ; 
an average energy E ; and 
an average LSF value. LSFn . 
n . The VAD according to claim 10, wherein the calculator unit further 

2 calculates: 

3 a spectral difference SD, using a normalized ltakura-Saito 



2 
3 
4 

5 

6 

7 

1 
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4 measure; 

5 a spectral difference SD. using a mean square error method: 

6 a spectral difference SD, using a mean square error method; 

7 and 

S a long-term mean of SD^. 

1 1 2- The V AD according to claim 1 1 . wherein the decision unit makes an 

2 initial frame voicing decision according to the values calculated by the calculation 

3 means 

^ 13. The VAD according to claim 1 2, wherein the initial frame voicing 

2 decision is smoothed. 

1 1 4. A voice activity detection method for detecting voice activity in an 

2 incoming speech signal frame, the improvement comprising making a voicing 

3 decision based on a pitch lag and a pitch gain of the speech signal frame. 

1 15. The voice activity detection method of claim 14, further comprising 

2 making the voicing decision based on a frame full band energy and a set of spectral 

3 parameters called Line Spectral Frequencies (LSF). 
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